Linear regression is an analysis that assesses whether one or more predictor variables explain the dependent (criterion) variable.
knitr::include_graphics("https://miro.medium.com/max/2872/1*k2bLmeYIG7z7dCyxADedhQ.png")
The regression has five key assumptions:
First, lets Install and load the requried packages.
install.packages('readr')
install.packages('ggplot2')
install.packages('mlbench')
install.packages('corrplot')
install.packages('Amelia')
install.packages('caret')
install.packages('plotly')
install.packages('caTools')
install.packages('reshape2')
install.packages('dplyr')
library(readr)
library(ggplot2)
library(corrplot)
library(mlbench)
library(Amelia)
library(plotly)
library(reshape2)
library(caret)
library(caTools)
library(dplyr)
Housing data contains 506 census tracts of Boston from the 1970 census. The dataframe BostonHousing contains the original data by Harrison and Rubinfeld (1979), the dataframe BostonHousing2 the corrected version with additional spatial information.
You can include this data by installing mlbench library.The data has following features, medv being the target variable:
##Data Load the BostonHousing data and assign it to the variable housing .
data(BostonHousing)
housing <- BostonHousing
str(housing)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : num 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Lets examine the head of the housing dataframe using head().
head(housing)
## crim zn indus chas nox rm age dis rad tax ptratio b lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
summary(housing)
## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
## rm age dis rad
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## tax ptratio b lstat
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
Next we have to clean this data.There are many ways to do this. I will be using missmap() from Amelia package.
missmap(housing,col=c('yellow','black'),y.at=1,y.labels='',legend=TRUE)
The above plot clearly shows that the data is free from NA’s.
Let’s use ggplot2,corrplot and plotly to explore the data a bit.
Correlation and CorrPlots
From Wikipedia, correlation is defined as:
In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or two sets of data. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other.
Correlation plots are a great way of exploring data and seeing if there are any interaction terms.
cor(select(housing,-chas))
## crim zn indus nox rm age
## crim 1.0000000 -0.2004692 0.4065834 0.4209717 -0.2192467 0.3527343
## zn -0.2004692 1.0000000 -0.5338282 -0.5166037 0.3119906 -0.5695373
## indus 0.4065834 -0.5338282 1.0000000 0.7636514 -0.3916759 0.6447785
## nox 0.4209717 -0.5166037 0.7636514 1.0000000 -0.3021882 0.7314701
## rm -0.2192467 0.3119906 -0.3916759 -0.3021882 1.0000000 -0.2402649
## age 0.3527343 -0.5695373 0.6447785 0.7314701 -0.2402649 1.0000000
## dis -0.3796701 0.6644082 -0.7080270 -0.7692301 0.2052462 -0.7478805
## rad 0.6255051 -0.3119478 0.5951293 0.6114406 -0.2098467 0.4560225
## tax 0.5827643 -0.3145633 0.7207602 0.6680232 -0.2920478 0.5064556
## ptratio 0.2899456 -0.3916785 0.3832476 0.1889327 -0.3555015 0.2615150
## b -0.3850639 0.1755203 -0.3569765 -0.3800506 0.1280686 -0.2735340
## lstat 0.4556215 -0.4129946 0.6037997 0.5908789 -0.6138083 0.6023385
## medv -0.3883046 0.3604453 -0.4837252 -0.4273208 0.6953599 -0.3769546
## dis rad tax ptratio b lstat
## crim -0.3796701 0.6255051 0.5827643 0.2899456 -0.3850639 0.4556215
## zn 0.6644082 -0.3119478 -0.3145633 -0.3916785 0.1755203 -0.4129946
## indus -0.7080270 0.5951293 0.7207602 0.3832476 -0.3569765 0.6037997
## nox -0.7692301 0.6114406 0.6680232 0.1889327 -0.3800506 0.5908789
## rm 0.2052462 -0.2098467 -0.2920478 -0.3555015 0.1280686 -0.6138083
## age -0.7478805 0.4560225 0.5064556 0.2615150 -0.2735340 0.6023385
## dis 1.0000000 -0.4945879 -0.5344316 -0.2324705 0.2915117 -0.4969958
## rad -0.4945879 1.0000000 0.9102282 0.4647412 -0.4444128 0.4886763
## tax -0.5344316 0.9102282 1.0000000 0.4608530 -0.4418080 0.5439934
## ptratio -0.2324705 0.4647412 0.4608530 1.0000000 -0.1773833 0.3740443
## b 0.2915117 -0.4444128 -0.4418080 -0.1773833 1.0000000 -0.3660869
## lstat -0.4969958 0.4886763 0.5439934 0.3740443 -0.3660869 1.0000000
## medv 0.2499287 -0.3816262 -0.4685359 -0.5077867 0.3334608 -0.7376627
## medv
## crim -0.3883046
## zn 0.3604453
## indus -0.4837252
## nox -0.4273208
## rm 0.6953599
## age -0.3769546
## dis 0.2499287
## rad -0.3816262
## tax -0.4685359
## ptratio -0.5077867
## b 0.3334608
## lstat -0.7376627
## medv 1.0000000
#Plotting Correlation
corrplot(cor(select(housing,-chas)))
medv decreases with increase in crim (medium), indus (High),nox(low),age(low),rad(low),tax(low),ptratio(high), lstat (High) and increases with increase in zn(low),rm(High).
####medv Density Plot using ggplot2
#visualizing the distribution of the target variable
housing %>%
ggplot(aes(medv)) +
stat_density() +
theme_bw()
The above visualizations reveal that peak densities of
medv are in between 15 and 30.
####medv Density using plotly
ggplotly(housing %>%
ggplot(aes(medv)) +
stat_density() +
theme_bw())
The above visualizations reveal that peak densities of medv are in between 15 and 30.
####medv Let’s see the effect of the variables in the dataframe on medv.
housing %>%
select(c(crim, rm, age, rad, tax, lstat, medv,indus,nox,ptratio,zn)) %>%
melt(id.vars = "medv") %>%
ggplot(aes(x = value, y = medv, colour = variable)) +
geom_point(alpha = 0.7) +
stat_smooth(aes(colour = "black")) +
facet_wrap(~variable, scales = "free", ncol = 2) +
labs(x = "Variable Value", y = "Median House Price ($1000s)") +
theme_minimal()
The results from the above graph are in correlation with the corrplot.
##Model Building & Prediction
###General Form
The General Linear regression model in R :
Univariate Model : \(model <- lm(y\sim x,data)\)
Multivariate Model : \(model <- lm(y \sim.,data)\)
###Train and Test Data
Lets split the data into train and test data using caTools library.
#set a seed
set.seed(123)
#Split the data , `split()` assigns a booleans to a new column based on the SplitRatio specified.
split <- sample.split(housing,SplitRatio =0.75)
train <- subset(housing,split==TRUE)
test <- subset(housing,split==FALSE)
# train <- select(train,-b)
# test <- select(test,-b)
###Training our Model Lets build our model considering that crim,rm,tax,lstat as the major influences on the target variable.
Test for significance of regression: H0 : β1 = β2 = · · · = βk = 0; H1 : βj 6= 0 for at least one j 6= 0. Note that under H0, β0 is still non-zero: H0 : y = β0 + e.
The coefficient of determination,
knitr::include_graphics("https://www.gstatic.com/education/formulas2/355397047/en/coefficient_of_determination.svg")
model <- lm(medv ~ crim + rm + tax + lstat , data = train)
summary(model)
##
## Call:
## lm(formula = medv ~ crim + rm + tax + lstat, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.266 -3.185 -1.052 2.116 30.121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.767079 3.573477 -1.054 0.29251
## crim -0.070793 0.037113 -1.908 0.05725 .
## rm 5.580390 0.492854 11.323 < 2e-16 ***
## tax -0.006392 0.002114 -3.023 0.00268 **
## lstat -0.483836 0.058230 -8.309 2.04e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.183 on 357 degrees of freedom
## Multiple R-squared: 0.6816, Adjusted R-squared: 0.678
## F-statistic: 191.1 on 4 and 357 DF, p-value: < 2.2e-16
###Visualizing our Model Lets visualize our linear regression model by plotting the residuals. The difference between the observed value of the dependent variable (y) and the predicted value (y) is called the residual (e).
res <- residuals(model)
# Convert residuals to a DataFrame
res <- as.data.frame(res)
ggplot(res,aes(res)) + geom_histogram(fill='blue',alpha=0.5)
plot(model)
###Predictions
Let’s test our model by predicting on our testing dataset.
test$predicted.medv <- predict(model,test)
pl1 <-test %>%
ggplot(aes(medv,predicted.medv)) +
geom_point(alpha=0.5) +
stat_smooth(aes(colour='black')) +
xlab('Actual value of medv') +
ylab('Predicted value of medv')+
theme_bw()
ggplotly(pl1)
Lets evaluate our model using Root Mean Square Error, a standardized measure of how off we were with our predicted values.
###Assessing our Model
error <- test$medv-test$predicted.medv
#error
rmse <- sqrt(mean(error)^2)
rmse
## [1] 0.7602882
The Root Mean Square Error (RMSE) for our Model is 0.7602882 and the Results can be further improved using feature extraction and rebuilding,training the model.